-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General Druid refactors #16708
General Druid refactors #16708
Conversation
processing/src/main/java/org/apache/druid/segment/data/FrontCodedIntArrayIndexedWriter.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/segment/data/FixedIndexedWriter.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/data/FixedIndexedWriter.java
Show resolved
Hide resolved
...ssing/src/main/java/org/apache/druid/segment/loading/MMappedQueryableSegmentizerFactory.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/data/VByte.java
Outdated
Show resolved
Hide resolved
@@ -91,7 +91,7 @@ public int numRows() | |||
public boolean isNull(int rowNum) | |||
{ | |||
offset.set(rowNum); | |||
return valueSelector.isNull(); | |||
return valueSelector.getObject() == null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have a different contract than BaseNullableColumnValueSelector
? Its isNull
method means that getLong
/getFloat
/getDouble
will return a null value, not necessarily that getObject
is null. E.g. sometimes non-numeric selectors will implement these primitive getters, but not necessarily be null rows.
Looking at some of the things that implement it, it doesn't appear to be different, e.g. ObjectColumnAccessorBase
has an implementation of getLong
that checks if the instance is a String
, and if so, tries to parse it. This means that getObject
will not return null, but getLong
isn't valid, so there is no way to determine that the 0
it returns is actually a null
because we couldn't parse the string 'hello'
into a long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you know if it is safe to call getLong
if that is the contract?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we should eliminate implicit casting in the numeric getter methods to make it so that they only work if the column type really is a number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the javadoc for ColumnAccessor (Get whether the value of a row is null
), the updated behaviour would be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we need to revisit this contract, since it doesn't really feel compatible to coexist with implicit casting on the getLong
/getFloat
etc methods. If you have a string column but want to read numbers from it but not all of them can be parsed into numbers, you can't reliably use any of the primitive getters, because isNull
would say false, but getLong
would not be parseable into a number and return 0. See ObjectColumnAccessorBase
.
This means you can't actually use these methods if you want to try to get numbers from an object column, so they should probably instead throw exceptions instead of silently giving you zeros that should have been handled as nulls.
We can fix this by just forbidding this sort of implicit casting, I'm not sure it does us any favors in terms of making code clear to follow, while having explicit casting wrapper gizmos would make it obvious when something is turning a string into a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that it has to be done now, but it does seem problematic and is why ColumnValueSelector.isNull
has the strange contract it does, because you kind of need it if you allow implicit casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics as defined here is that a non-parseable String, when read from getLong
is 0
instead of null
. If something wants that to be null
, it needs to do its own parsing/handling. It is entirely possible to, if you have a String object, use the getObject()
form and handle it yourself.
We could make this logic be generically accessible by having a LongAccessor
interface and ask the column to become one of those when it wants to force-cast to Long
. What I'm trying to say is that I understand the semantics you are looking for and I think that it's totally valid, I don't think it has to be delivered from this set of getters
which just exist to provide primitive values in the simplest possible way. That said, perhaps if we had the type-specific accessors we would end up not needing these generic get
methods and then life would just be better. Either way, it's something that can be adjusted in another PR.
processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java
Outdated
Show resolved
Hide resolved
return FindResult.found(startIndex, end); | ||
} | ||
|
||
int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't really new since the others do this too, but is there any guarantee that vals
is sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods are part of the BinarySearchableAccessor
, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column
. That's my bad really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw, the whole BinarySearchableAccessor
is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst-case time complexity of findLong
is O(n) if all the values from startIndex
to endIndex
are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, none of the "find" stuff is actually being used for anything meaningful. Ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's more likely that we have distinct values here, but currently, this is in line with the other type columns. If we make a call on using multiple searches instead, that should be consistent with the other columns here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adarshsanjeev Lets java doc the oddity for now and move forward? We can take up this work as part of a follow up refactor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this to the javadoc in BinarySearchableAccessor
@Override | ||
public FindResult findComplex(int startIndex, int endIndex, Object val) | ||
{ | ||
return findLong(startIndex, endIndex, Numbers.tryParseLong(val, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know this isn't really new, but why doesn't this need to distinguish between the case where val
is parseable into a number and not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meta answer to your question is because it's not an interface that's used for anything actually meaningful yet.
The actual answer to your question is that the one place that is trying to use this is first checking for null
and then calling findNull
It then does a switch on the searcher's type and calls the relevant method for that. The type of the LongArrayColumn
is LONG
so it won't actually call findComplex()
and will call findLong()
instead. This is fundamentally putting the responsibility on the call site to "do the right thing". Is that the correct behavior or not? I dunno, but it's an experimental interface that you would have to be incredibly intrepid to actually hit, so it's a detail that perhaps doesn't matter.
If your suggestion is to switch this to a NotYetImplemented
exception, I'd probably agree with that suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the numberColumns to throw an exception instead of using a 0 if the string is unparsable for findComplex
and findString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. Make sure you have green checks and that clint is satisfied before merge.
return FindResult.found(startIndex, end); | ||
} | ||
|
||
int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods are part of the BinarySearchableAccessor
, which is fundamentally just saying "if you want to binary search and you know it's safe, use these methods". Instead of making the accessor itself implement this, it probably would've been better to make it another interface off of the Column
. That's my bad really.
processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java
Outdated
Show resolved
Hide resolved
return FindResult.found(startIndex, end); | ||
} | ||
|
||
int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw, the whole BinarySearchableAccessor
is actually only used by an experimental join implementation that exists to exercise the operator stuff. Your commentary about the oddities are totally valid and belong against that. It's not really a solidified part of the code base though, so we should perhaps add some javadocs to the interface itself, talk a bit about why it's awkward and when/if we get back to it, we can improve it.
@@ -91,7 +91,7 @@ public int numRows() | |||
public boolean isNull(int rowNum) | |||
{ | |||
offset.set(rowNum); | |||
return valueSelector.isNull(); | |||
return valueSelector.getObject() == null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contract here is supposed to be abandoning the "numerical null" thing that the ValueSelectors follow. It is explicitly "is it null" rather than "will it be null if cast to a primitive type"
processing/src/main/java/org/apache/druid/segment/data/FixedIndexedWriter.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/data/FixedIndexedWriter.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/query/rowsandcols/column/LongArrayColumn.java
Outdated
Show resolved
Hide resolved
return FindResult.found(startIndex, end); | ||
} | ||
|
||
int i = Arrays.binarySearch(vals, startIndex, endIndex, val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worst-case time complexity of findLong
is O(n) if all the values from startIndex
to endIndex
are the same. It can be improved by doing two binary searches to find the upper and the lower bound. The current implementation is the same in other places. Is it worth switching to the proposed one instead?
If the array always has mostly distinct values, then the current implementation would work better on average.
*/ | ||
public class NotYetImplemented extends DruidException.Failure | ||
{ | ||
public static DruidException ex(String msg, Object... args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to not have factory methods which omit the cause exception
I've seen places where there was a catch
block - but the newly created exception did not had the cause
filled in most likely by mistake - but as a result the stacktrace was clipped.
I think forcing the api user to write out null
for cause makes this mistake less probable to do the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
++firstNonDruidExceptionIndex; | ||
} | ||
if (firstNonDruidExceptionIndex < stackTrace.length) { | ||
retVal.setStackTrace(Arrays.copyOfRange(stackTrace, firstNonDruidExceptionIndex, stackTrace.length)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conditional looks odd... its not wrong; but it will only skip the set if it went over the full stacktrace; but when its 0
it still does it...
I guess the intention here is to clip the exception construction process?
could you add a test for this?
with a case which checks what happens if a DruidException
is set as a cause for another DruidException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to if (firstNonDruidExceptionIndex > 0)
If firstNonDruidExceptionIndex == stackTrace.length
, we should copy an empty array, which is what would happen, and if the the exception is 0, we should not need a copy.
processing/src/main/java/org/apache/druid/error/NotYetImplemented.java
Outdated
Show resolved
Hide resolved
Just looking at the comments, it looks like things have generally been addressed. Clint has an outstanding comment about the semantics of the ColumnAccessor interface, which are valid but also not the central theme of this PR. As such, I don't see a reason to delay the merge of these changes, so going to push the merge button. |
Some general refactors across Druid.
write
.This PR has: